대용량 데이터 처리를 위한 온라인 알고리즘과 외부 메모리 학습

큰 데이터를 가지고 그리드 서치와 같이 복잡한 연산을 수행할 시 계산 비용이 많이 소요된다.

심지어 컴퓨터 메모리를 초과하는 대량의 데이터를 다루는 경우도 드물지 않게 발생한다.

외부 메모리 학습(out-of-core learning)

외부 메모리 학습은 데이터셋을 작은 배치(batch)로 나누어 분류기를 점진적으로 학습시키는 기법이다.

import numpy as np

import re

from nltk.corpus import stopwords

stop=stopwords.words('english')

def tokenizer(text):

text=re.sub('<[^>]*>', '', text)

emoticons=re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())

text=re.sub('[\W]+', ' ', text.lower())+' '.join(emoticons).replace('-', '')

tokenized=[w for w in text.split() if w not in stop]

return tokenized

def stream_docs(path): # generator

with open(path, 'r', encoding='utf-8') as csv:

next(csv) # 헤더 넘기기

for line in csv:

text, label=line[:-3], int(line[-2])

yield text, label

stream_docs 함수 테스트

next(stream_docs(path='/Users/csian/Desktop/CP/data_set/movie_data.csv'))

('"at a Saturday matinee in my home town. I went with an older friend (he was about 12) and my mom let me go because she thought the film would be OK (it\'s rated G). I was assaulted by loud music, STRANGE images, no plot and a stubborn refusal to make ANY sense. We left halfway through because we were bored, frustrated and our ears hurt. I saw it 22 years later in a revival theatre. My opinion had changed--it\'s even WORSE! Basically everything I hated about it was still there and the film was VERY 60s...and has dated badly. I got all the little in-jokes...too bad they weren\'t funny. The constant shifts in tone got quickly annoying and there\'s absolutely nothing to get a firm grip on. Some people will love this. I found it frustrating...by the end of the film I felt like throwing something heavy at the screen. Also, all the Monkees songs in this movie SUCK (and I DO like them). For ex-hippies only...or if you\'re stoned. I give this a 1."', 0)

def get_minibatch(doc_stream, size):

docs, y=[], []

try:

for _ in range(size):

text, label=next(doc_stream)

docs.append(text)

y.append(label)

except StopIteration:

pass

return docs, y

외부 메모리 학습에는 CountVectorizer 클래스를 사용할 수 없다.

CountVectorizer는 전체 어휘 사전을 메모리에 가지고 있어야 하기 때문이다.

TfidfVectorizer 클래스도 역문서 빈도를 계산하기 위해 전체 훈련 데이터셋의 특성 벡터를 메모리에 가지고 있어야 한다.

HashingVectorizer는 데이터 종류에 상관없이 사용할 수 있으며 MurmurHash3 함수를 이용한 해싱(hashing) 트릭을 이용한다.

from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.linear_model import SGDClassifier

vect=HashingVectorizer(decode_error='ignore', n_features=2**21, preprocessor=None, tokenizer=tokenizer)

clf=SGDClassifier(loss='log', random_state=1, max_iter=1)

doc_stream=stream_docs(path='/Users/csian/Desktop/CP/data_set/movie_data.csv')

HashingVectorizer에서 특성 개수를 크게 하면, 해시 충돌 가능성을 줄일 수 있지만, 로지스틱 회귀 모델의 가중치 개수도

증가한다.

import pyprind

pbar=pyprind.ProgBar(45)

classes=np.array([0, 1])

for _ in range(45):

X_train, y_train=get_minibatch(doc_stream, size=1000)

if not X_train:

break

X_train=vect.transform(X_train)

clf.partial_fit(X_train, y_train, classes=classes)

pbar.update()

SGDClassifier(loss='log', max_iter=1, random_state=1)

0% [##############################] 100% | ETA: 00:00:00

Total time elapsed: 00:00:20

X_test, y_test=get_minibatch(doc_stream, size=5000)

X_test=vect.transform(X_test)

print('정확도: %.3f' %clf.score(X_test, y_test))

정확도: 0.873

clf=clf.partial_fit(X_test, y_test)

word2vec

BoW 모델을 대체할 수 있는 최신 방법으로 구글이 2013년에 공개한 알고리즘이다.

신경망을 기반으로 한 비지도 학습 알고리즘으로 자동을 단어 사이의 관계를 학습한다.